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Abstract 

In this paper, we examine the maximization of energy efficiency (EE) in next-generation multi-user MIMO-OEDM 
networks that evolve dynamically over time - e.g. due to user mobility, fluctuations in the wireless medium, modulations 
in the users’ load, etc. Contrary to the static/stationary regime, the system may evolve in an arbitrary manner so, targeting 
a fixed optimum state (either static or in the mean) becomes obsolete; instead, users must adjust to changes in the system 
“on the fly”, without being able to predict the state of the system in advance. To tackle these issues, we propose a simple 
and distributed online optimization policy that leads to no regret, i.e. it allows users to match (and typically outperform) 
even the best fixed transmit policy in hindsight, irrespective of how the system varies with time. Moreover, to account 
for the scarcity of perfect channel state information (CSI) in massive MIMO systems, we also study the algorithm’s 
robustness in the presence of measurement errors and observation noise. Importantly, the proposed policy retains its 
no-regret properties under very mild assumptions on the error statistics and, on average, it enjoys the same performance 
guarantees as in the noiseless, deterministic case. Our analysis is supplemented by extensive numerical simulations 
which show that, in realistic network environments, users track their individually optimum transmit profile even under 
rapidly changing channel conditions, achieving gains of up to 600 % in energy efficiency over uniform power allocation 
policies. 


Index Terms 

Energy efficiency; imperfect CSI; MIMO; OFDM; no regret; online optimization. 

I. Introduction 

The wildfire spread of Internet-enabled mobile devices and the exponential growth of bandwidth-hungry applica¬ 
tions is putting existing wireless systems under enormous strain and is one of the driving forces behind the transition 
to fifth generation (5G) mobile networks [1]. In this way, the ICT industry is faced with a formidable mission: data 
rates must be increased significantly in order to meet the soaring demand for wireless broadband, but this task must 
be accomplished under an extremely tight energy budget. Thus, to achieve the seamless integration of a diverse set 
of mobile users, applications and services, current design requirements for 5G systems target a dramatic decrease in 
energy-per-bit consumption of the order of 1, OOOx or more [2, 3]. 
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A contending technology to achieve these design targets is the emerging massive MIMO (multiple-input and 
multiple-output) paradigm. Coupled with the use of multiple carrier frequencies via orthogonal frequency division 
multiplexing (OFDM), massive MIMO “goes large” by employing inexpensive service antennas to focus energy into 
ever smaller regions of space [4-6]. As a result, very large MIMO arrays can greatly enhance the reliability of wireless 
connections and increase throughput and energy efficiency (EE) by a factor of lOx to lOOx without requiring the 
deployment of expensive new air interfaces [1, 6]. However, due to the massive complexity and variability of such 
systems, a crucial challenge that arises is that wireless users must also be capable of adapting to a dynamic spectrum 
landscape “on the fly”, usually with minimal coordination and limited information at the device end. 

An added challenge in the above considerations is that wireless users often do not have access to perfect channel state 
information (CSI) and co-channel interference (CCI) measurements, especially at the transmitter end - for instance, 
due to pilot contamination in massive MIMO systems [6]. In particular, if the system operates in the presence of un¬ 
certainty (imperfect CSI, observation noise, etc.), optimization techniques that rely on a greedy, “one-ofF’ calculation 
of optimal transmit characteristics (such as water-hlling) are no longer suitable because stochastic fluctuations could 
lead the system to a suboptimal state [7, 8]. On that account, our main objective in this paper will be to provide an 
adaptive transmit policy for energy efficiency maximization in dynamic MIMO-OEDM networks that are subject to 
uncertainty, measurement errors and/or other unpredictable changes in the wireless medium. 

In the general context of MIMO-OEDM systems, the vast majority of works on energy efficiency maximization and 
energy-efficient power allocation have focused on two limit cases [9]. In the static regime [10-15], the attributes of the 
wireless system under study (channel gains, user load, etc.) are assumed effectively static and the system’s analysis 
revolves around techniques from the theory of non-cooperative games and optimization (continuous or discrete). At 
the other end of the spectrum, in the ergodic regime [12, 16], the wireless medium is assumed to evolve over a very 
fast time scale, typically following a sequence of independent and identically distributed (i.i.d.) random variables; 
consequently, the figure of merit in problems of this type is the stochastic average of the users’ energy efficiency 
function. All these works study the trade-off between the Shannon achievable rate and power consumption either for a 
single user (via fractional programming) or multiple ones (using the theory of non-cooperative games). Einally, in the 
static channel regime, [17-22] consider a throughput model that depends on the connection’s bit error rate (BER) and 
use tools from game theory to characterize the system’s stable (equilibrium) states. 

In this paper, we focus squarely on dynamic MIMO-OEDM systems that evolve arbitrarily over time (e.g. due to 
channel variability, fading, mobility, etc.), and we make no statistical hypotheses regarding the dynamics that govern 
the network’s evolution (such as stationarity or ergodicity). As opposed to the stationary/ergodic regime discussed 
above, static solution concepts such as Nash/correlated equilibria are no longer relevant because there is no underlying 
target state to attain (either static or in the mean); as such, no conclusions can be drawn from the existing literature on 
energy-efficient power allocation. Instead, users have to optimize their transmit characteristics on the fly, based only on 
locally available information of the past state of the system, and hoping to track (or at least emulate the performance 
of) the a posteriori optimum transmit policy. 

The most widely used optimization criterion in this setting is that of regret minimization, a seminal notion which 
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was first introduced by Hannan [23] and which has since given rise to a vigorous literature at the interface of machine 
learning, optimization, statistics, and game theory - for a comprehensive survey, see e.g. [24-26]. More precisely, in 
the language of game theory, a user’s (cumulative) regret over a given time horizon is simply the difference between his 
average payoff (over the time horizon in question) and the payoff that he would have obtained if he had employed the 
best possible hxed action in hindsight. Accordingly, in our case, regret minimization corresponds to dynamic transmit 
policies that are asymptotically optimal in hindsight, irrespective of how the users’ effective wireless medium evolves 
over time. 

A regret-based approach was recently employed by the authors of [27] who studied the problem of power control 
in infrastructureless wireless networks and proposed an algorithm that minimizes the users’ (internal) regret to attain 
the system’s equilibrium. In a similar vein, [28] studied the transient phase of the Foschini-Miljanic (FM) power 
control algorithm in static environments and used the notion of swap regret [29] to propose alternative convergent 
power control schemes; even more recently, [30] showed that the FM dynamics lead to no regret, so they retain 
their optimality properties in dynamic environments. Finally, [31] and [32] used online optimization techniques and 
a methodology based on matrix exponential learning [7, 8, 33, 34] to derive a no-regret adaptive transmit policy for 
power control and throughput maximization in cognitive radio networks respectively. However, the proposed policies 
drive wireless users to transmit at either full or minimum power (subject to their rate requirements), so they cannot be 
applied to minimize energy-per-bit consumption in dynamic MIMO-OFDM systems. 

Summary of results and paper outline 

In this paper, we formulate the maximization of energy efhciency in dynamic MIMO-OFDM systems as an online 
semidehnite optimization problem and, drawing on Zinkevich’s seminal online gradient ascent (OGA) methodology 
[35], we propose an adaptive transmit policy which is asymptotically optimal in hindsight - i.e. that leads to no 
regret. In particular, we show that the proposed algorithm guarantees an ©(F *^^) regret bound after T update epochs 
(transmission frames), and this bound tightens to C7(log TIT) if the users’ channel gains always remain above a given 
level. Furthermore, to address the lack of perfect measurements and channel state information at the transmitter (CSIT), 
we show that the proposed algorithm retains its optimality properties under very mild statistical hypotheses that are 
satished by the vast majority of error distributions. Specihcally, as long as a) there is no systematic error in the 
measurement process; and b) the probability of observing very large errors (z) is not higher than 0(1 fz^), the proposed 
policy leads to no regret and enjoys a mean bound of the same order as in the deterministic case. 

The performance of the proposed transmit policy is validated by means of extensive numerical simulations modeling 
a cellular orthogonal frequency-division multiple access (OFDMA) network with multiple base stations and mobile 
MIMO users with realistic wireless propagation, fading and mobility features. Our results show that the proposed 
policy represents a scalable and flexible method that allows users to attain very high energy efficiency levels, with 
gains of up to 600% over uniform/fixed power allocation policies and with surprisingly modest feedback requirements. 

Our work here greatly extends our recent conference paper [36] where we derived a continuous-time exponential 
learning method for energy efficiency maximization in dynamic single-input and single-output (SISO) systems. Com- 
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pared to [36], the current paper provides a bona fide learning algorithm for multiple-antenna systems, with discrete¬ 
time updates and performance guarantees, and with proven robustness in the presence of uncertainty and observation 
noise. 

The rest of our paper is stiTictured as follows: in Section II, we present our wireless system model and we formulate 
the problem of dynamic energy efficiency maximization as an online semidehnite program. In Section III, we derive 
our online learning policy, and we establish its no-regret properties and performance guarantees under both perfect 
and imperfect CSI. Finally, our theoretical analysis is supplemented by extensive numerical simulations in Section IV 
where we illustrate the gains of the proposed policy under realistic channel gain and mobility conditions. 


II. System Model and Problem Formulation 


Consider a wireless network consisting of several point-to-point connections u 6 lA - {l,...,t/) (the system’s 
users) that are established over a set of orthogonal subcarriers k e 1C = {I,..., K}. Each connection u eU represents 
a pair of communicating wireless multi-antenna devices with antennas at the transmitter and V„ antennas at the 
receiver. Thus, focusing on the uplink case, if e C^“ and 6 C^“ denote the signals transmitted and received over 
connection u on subcarrier k, we obtain the familiar baseband signal model: 


y^ = Hrx“+;^^,^^Hf'x“'+z“, 


(1) 


where 6 denotes the transfer matrix between the w'-th transmitter and the M-th receiver over subcaiTier k 

while z^‘ is the ambient noise over the channel (including thermal and atmospheric effects, and modeled as a circularly 
symmetric Gaussian complex vector). In this way, the multi-user interference-plus-noise (MUI) at the intended receiver 
of the M-th connection will be: 


< = Z. 


u'^u 


Hrx“ + z‘ 


'A:’ 


( 2 ) 


so (1) may be written more simply as: 


y“ = Hrx“ + w“ 


(3) 


In the rest of this paper, we will focus on a specihc connection u e U and we will treat the MUI vector w<. as an 
aggregate noise variable whose covariance depends on the wireless medium and the transmit characteristics of all other 
users. As such, if we drop the user index u for notational convenience, the signal model (3) attains the more compact 
form: 

yk = H^x^ H- (4) 


Hence, assuming Gaussian input and single user decoding (SUD) at the receiver, the Shannon rate at the focal 
connection will be given by the well-known expression [37]:* 

^(Q) = Xfeic K - log det W,], (5) 


*For the sake of simplicity, constant multiplicative factors such as the bandwidth of the connection have been dropped in (5); these factors are 
reinstated in the numerical analysis of Section IV. 
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where; 

1) Q^. = E[X;(.xJ] e is the user’s input signal covariance matrix over subcarrier k? 

2) Q = diag(Qi,..., Q^:) is the power profile of the focal user over all subcarriers. 

3) Wj; = E[wa:wJ] e is the MUI covariance matrix of the co-channel interference plus noise affecting the focal 
connection (obviously, depends on all other users in the network). 

Remark 1. The Gaussian input and noise assumptions are fairly standard in the literature: in particular, Gaussian 
noise is known to be the worst additive noise distribution with respect to the Shannon achievable rate [38] while 
Gaussian input is optimal against a Gaussian environment [37]. Finally, regarding the decoding technique, SUD has 
the advantage of being simple, distributed, and scalable as it does not require any coordination or signaling among the 
interfering users. 

In view of the above, if we let 

Hr = W,‘/"Hr (6) 


denote the effective channel matrix of the focal user over subcarrier k, the user’s Shannon rate (7) can be written more 
concisely as: 

R(Q) = log det (l + HrQrHl) = log det (l + HQH'/) , (7) 


where H = diag(Hi,... ,H/i-) is the block-diagonal sum of the user’s effective channel matrices over all subcarriers. 
Thus, following [11-14], the user’s energy efficiency function is defined as his Shannon rate per unit of consumed 
power, i.e. 


EE(Q) = 


R(Q) logdet(lH-HQHl) 


( 8 ) 


Pc + tr(Q) Pc + tr(Q) 

where tr(Q) = Yjk tr(Q/t) is the user’s total transmit power while Pc denotes the total power dissipated in all other circuit 
components of the transmitting device (mixer, frequency synthesizer, digital-to-analog converter, etc.). This efficiency 
function (which, formally, has units of bits/Joule) has been widely studied in the literature [12, 19, 39] and it captures 
the fundamental trade-off between higher spectral efficiency and increased battery life. Consequently, in the context of 
power-limited, energy-aware users, we obtain the maximization problem: 


maximize EE(Q), 
subject to Q e Q, 

where 

Q = |diag(Qi,..., Qk) ■ Qk > 0, Y,k tr(Qt) < Pmax], 


(9) 

( 10 ) 


and Emax denotes the user’s maximum transmit power. 

Of course, the user’s energy efficiency function depends not only on the transmitter’s signal covariance profile Q, 
but also on the transmit characteristics of all other users via the effective channel matrices Hj;: in particular, H collects 


^In the above, expectations are taken over the users’ codebooks (assumed Gaussian). 
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all sources of noise and interference that cannot be controlled by the focal transmit/receive pair, so the user’s energy 
efficiency objective may vary itself over time in an unpredictable way. On that account, since we wish to focus on 
dynamic networks that evolve in an arbitrary fashion, we will not be making any specific postulates regarding the 
behavior of other users in the network and/or the evolution of the user’s actual channel matrix H. The only generic 
assumptions that we will make regarding the effective channel matrices H are; 

(Al) H remains bounded over the entire transmission horizon (e.g. due to the minimum distance between transmitter 
and receiver, RF circuit losses, antenna directivity, etc.). 

(A2) The variability of H within each transmission frame is suficiently slow so that the standard caveats of information 
theory remain valid. 

Consequently, if H(f) is the user’s effective channel matrix at time f, we obtain the following online energy efficiency 
problem; 


maximize EE(Q; f), 
subject to Q 6 Q, 


(OEE) 


where, in obvious notation; 


EE(Q; t) = 


log det (I + H(f)QHT(f)) 


( 11 ) 


Pc + tr(Q) 

denotes the user’s energy efficiency function at time t. Thus, given that the user cannot predict the state of the system 
ahead of time, we will focus on the following sequence of events; 

1) At each update period n - 1,2,..., the user selects a transmit power profile Q(n) e Q. 

2) The user’s energy efficiency over the current period is determined by the effective channel matrix H(n) at the time 
of transmission. 

3) At the end of the period, the user selects a new signal covariance profile Q(n + 1) seeking to maximize his a priori 
unknown objective function EE(Q; n + 1), and the process repeats. 

Of course, the key challenge in this dynamic framework is that the user does not know ahead of time the effective 
channel matrix H(n + 1) that determines his energy efficiency function at stage n + 1, so he must try to adapt to 
the changing network conditions “on the fly”. To be sure, if the user had perfect foresight and knowledge of the 
evolution of H(n) in advance, the (fixed) power profile that maximizes the user’s average energy efficiency over a 
given transmission horizon T would be the solution to the (offline) maximization problem; 


max I- y ^ EE(Q; n). (12) 

QeQ T ^n=\ 

Obviously however, this “oracle” solution cannot be computed without precognitive abilities, so we will focus on 
adaptive transmit policies Q(n) that approach the maximal value of (12) asymptotically, irrespective of the system’s 
evolution over time. 

To make this analysis precise, we define the user’s (cumulative) regret at time T as the cumulative difference between 
the user’s achieved EE and the solution of the maximization problem (12), i.e. we let; 


Reg(7’) = max V ^ [EE(Q; n) - EE(Q(n);«)]. 

QeQ Z--in=l 


( 13 ) 
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We then say that a dynamic transmit policy Q(n) leads to no regret if 

limsup r * Reg(7’) < 0, or, equivalently: Reg(r) = o(r), (14) 

T —»oo 

independently of the evolution of the user’s energy efficiency function. In this way, a no-regret policy Q(n) is asymp¬ 
totically optimal in hindsight in that it provides an asymptotic solution to the average energy efficiency maximization 
problem (12), without requiring any oracle-like capabilities from the user. 

The seminal notion of regret was hrst introduced in a game-theoretic setting by Hannan [23] and it has since given 
rise to a vast corpus of research at the interface of optimization, statistics and machine learning - for a recent survey, 
see e.g. [24, 25]. In particular, if the user’s energy efficiency function does not vary with time (i.e. if the user’s effective 
channels are static), standard arguments from the theory of online optimization [24] can be used to show that no-regret 
policies converge to the set Q* - arg maxg EE(Q) of maximally energy-efficient power prohles that solve the (static) 
problem (9). Likewise, if the user could somehow predict an instantaneous optimum policy Q*(n) e arg maxg EE(Q; n) 
ahead of every stage n - 1,2,..., T, we would have Reg(7’) < 0 for all T ; by this token, the no-regret requirement 
(14) is a crucial indicator that Q(n) tracks the optimal solution of (OEE) as it evolves over time. The quality of this 
tracking can be quantified by more sophisticated regret notions such as adaptive [40] or shifting [41] regret. We focus 
here on the simpler case of external regret minimization due to space limitations; however, in Section IV, we explore 
this issue via extensive numerical simulations. 

Remark. We should also note here that the no-regret property (14) is a “worst-case” guarantee that carries no assump¬ 
tions on the evolution of the user’s environment over time: the user’s channels could evolve randomly (following some 
stationary, ergodic process, as in the case of fast-fading), adversarially (e.g. if the user is subject to jamming), or not 
at all (in the static regime). As such, in the special case where the wireless medium is affected only by the behavior of 
other users in the network, a natural question that arises is whether the use of a no-regret policy by all users leads to 
an equilibrium of the underlying game.^ We address this issue in more detail in Section IV. 

III. Online Learning 

A first idea to achieve no regret in the online energy efficiency maximization problem (OEE) would be to calculate at 
each stage the power prohle that maximizes energy efficiency based on the latest available information at the previous 
stage. However, as can be seen by a standard online optimization argument, this policy may lead to positive regret: for 
instance, when the user’s channel alternates every other period between two values - say and H/, with corresponding 
optimal power profiles Q* and Q]] - best-responding to the last observed system state performs strictly worse than the 
fixed policy (Q* -H Qp/2 [24]. With this in mind, we propose in this section an adaptive power allocation policy that 
utilizes all past information in a recursive way based on Zinkevich’s seminal OGA method [35]. 

Eor simplicity, we hrst consider the case where the transmitter has access to perfect CSI and MUI measurements 
and we derive an anytime bound for the user’s regret; we then show that this bound can be tightened to 

^For instance, it is well-known that internal regret minimization implies convergence to the set of correlated equilibria [42]. 
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OQogT) if the user’s effective channels always remain above a certain threshold and the algorithm’s step-size is 
chosen accordingly. The robustness of these guarantees in the presence of noise and uncertainty is then discussed in 
Section Ill-C. 


A. Energy efficiency maximization as an online concave problem 

The first difficulty in designing a no-regret policy for the online fractional program (OEE) is that the user’s energy 
efficiency function is not concave. This is perhaps most easily seen in the SISO case where the user’s energy efficiency 
objective becomes: 

EE(P) = (15) 


Pc + TjkPk 

where p = (pi,... ,Pk) denotes the user’s power allocation vector and gk is the effective channel gain of channel k. 
Clearly, the fractional objective (15) is not concave with respect to any pk, however, EE(p) can be recast as a concave 
function by employing the so-called Charnes-Cooper transformation [43] for turning fractional programs into concave 
ones.^ Specifically, if we set 


XO = {Pc + Y,k Pk) 


-1 


X = xo ■ p. 


(16) 


we readily obtain EE(p) = xq log(l + gkXklxo), and this last function is concave because the summands xo log(l -i- 
gkXklxo) are jointly concave in xq and Xk- We may then get rid of the parameter xq by noticing that xq = ^(1 - Ziit Xk) 
which restricts the energy efhciency in {xo,Xk) to an afhne set on which it remains concave. Thus, by rewriting x as 
X = T^/{Pc + Yik Pk), solving for p and substituting in EE(p) to obtain a concave reformulation of (15). 

In the general MIMO framework, this procedure amounts to the change of variables: 

Pc + Pmax Q 


X = 


(17) 


Pmax Pc + tr(Q) ’ 

where we have introduced the normalization constant (Pc + Pmwd/Pmax in order to have tr(X) < 1 for all Q e Q (with 
equality if and only if tr(Q) = Pmax)- Solving for Q then yields 

P P 

^ -* C-* max 


Pc+ fmax(l - tr(X)) 

SO, after substituting in (8), we obtain the maximization objective 


X, 


m(X) = EE(Q) = 


Pc + fmax(l - tr(X)) 
Pc(Pc + fmax) 


log det 


1 + 


Pc + T'max(l - tr(X)) 


while the corresponding feasible region of (9) attains the simple form: 

AT = {diag(Xi,. ..,Xk):Xk>0 and Zk tr(X,) < 1). 


(18) 


(19) 


( 20 ) 


Given that R{Q) is concave in Q, the function F{X, x) = 


-X ■ R{X/x) will be jointly concave in X and x [44], 


^c+^max 

so m(X) will also be concave in X as the restriction of F{X, x) to the convex set PcPmaxX - Pc + ^max(l “ tr(X)). In this 


“^See also [12] for a similar use of the Charnes-Cooper transformation in the context of energy efficiency maximization. 










way, (OEE) boils down to the online concave maximization problem; 

maximize m(X; n). 
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( 21 ) 


subject to X e AT, 

where, as before, the dependence on n = 1,2..., reflects the evolution of the user’s effective channel matrices over 
time. Thus, in view of all this, we will hrst derive a no-regret transmit policy X(n) for the online concave problem (21) 
and we will then use the inverse transformation (18) to obtain a no-regret policy for (OEE). 


B. Learning with perfect CSI 

Building on Zinkevich’s online gradient ascent method [35], the core idea of our approach will be to track the 
gradient matrix V = Vm of the user’s (time-varying) utility function and then project back to the problem’s feasible 
region when the user’s power constraints are violated. To that end, some straightforward matrix calculus yields; 

V = Vm = 

Pc Pmax 

where Q is calculated in terms of X via (18) and 

A = V/;(Q) = H^[I -I- HQH' J^'h. (23) 

The above expression shows that V can be calculated at the transmitter as a function of the connection’s effective 
channel matrix H (which, in turn, can be estimated at the receiver end and then fed back to the transmitter via a 
dedicated backbone channel or as part of a TDD downlink subframe). Moreover, since V is a bounded function of H 
and the channel matrices H(n) are assumed bounded, the induced sequence of gradient matrices V(n) s VM(X(n); n) 
will also be bounded. We will therefore assume that there exists a constant Vq > 0 such that 


A H- 


tr(AQ)-/;(Q) 


( 22 ) 


||V(n)|| < Vo for all n = 1,2,..., (24) 

where ||V|| = tr(V* V)'^^ denotes the Erobenius (matrix) norm of V. 

In view of the above, and assuming for the moment perfect knowledge of V(n) at the transmitter, we will consider 
the matrix-valued online gradient ascent scheme; 

X(n-Hl) = n(X(n)-^r„V(n)), (OGA) 

where y„ > 0 is a nonincreasing step-size sequence and 11 denotes the matrix projection map; 

n(Y) = argminx,;»,||X-Y||2. (25) 

As we show in Appendix A, the matrix projection n(Y) can be calculated by the simple expression; 

n(Y) = U ■ diag(;r(y)) ■ (26) 

where the tuple (y, U) diagonalizes Y (i.e. Y = U ■ diag(y) ■ U’) and 

0 if y, < 0, 

yt ify,-> 0 and 2;[yy]+< 1, 

[yi - if yi > 0 and Zj[yj]+ ^ 1 > 


ttiiy) = 


(27) 
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Fig. 1. Schematic representation of the recursive learning scheme (OGA). 

with /i > 0 chosen so that J^r,y.>o [yi - = 1-^ 

The iterative process (OGA) will be the main focus of our paper, so we proceed with some remarks (see also Fig. 1 
for a schematic representation and Alg. 1 for a pseudocode version): 

a) Implementation properties: From a practical point of view, (OGA) has the following desirable properties: 
(PI) Distributedness: users require the same information as in distributed water-hlling [45, 46]. 

(P2) Statelessness: users do not need to know the state of the system (e.g. the number of users in the network or its 
topology). 

(P3) Reinforcement: users tend to become more energy-efficient based on their past observations. 

(P4) Asynchronicity: the algorithm does not require a global update timer or any further signaling/coordination be¬ 
tween users. 

b) Computational complexity: From a computational standpoint (which is crucial in massive MIMO systems), 
each iteration of Algorithm 1 requires a number of elementary binary operations which is polynomial (with a low 
degree) on the number of transmit/receive antennas and the number of subcarriers. Specihcally, letting S - maxjM, N] 
and recalling that H and Q consist of K diagonal blocks, the required matrix multiplication and inversion steps for 
A and V carry a complexity of 0{KS‘^), with the complexity exponent at being as low as 2.373 if fast Copper- 
smith-Winograd multiplication methods are employed [47]. As for the projection step X = n(Y), Eqs. (26) and (27) 
show that it can also be carried out in 0{KS‘^) operations: the diagonalization in (26) involves 0(KS‘^) steps while 
(27) only requires 0{KM) operations for calculating the projection to the simplex [48]. 

With all this in mind, our main result for (OGA) is as follows: 

^Recall here that Y(n) is Hermitian (because V(«) is Hermitian for all n), so its eigenvalues are real. Just as in water-filling methods [45], the 
Lagrange multiplier 4 > 0 can then be calculated by sorting y and performing a line search for A. 



11 


Algorithm 1 Online gradient ascent (OGA) for dynamic energy efficiency maximization. 
Parameter; variable step-size sequence y„ > 0. 

Initialize: n «— 0; X «— 0. 

Repeat 

n <— n -I- 1; 

{ Pre-transmission phase: set signai covariance matrix} 

Q ^ PcPn,.J(Pc + Pmax(l - tT X)) ■ X; 

transmit; 

{ Post-transmission phase: receive feedback and update } 
getH; 

A ^ H [I + HQH ] ‘H; 

V <- PmaxliPc + Pmax ) (A+[tr(AQ)-/?(Q))/P,-I]; 

X^n(X + r„V); 

until transmission ends. 


Theorem 1. Assume that (OGA) is run with a variable step-size jn such that jn —r 0 and ny„ —> oo. Then, the induced 
transmit policy Q(n) leads to no regret in the online energy efficiency maximization problem (OEE); specifically, 
(OGA) enjoys the cumulative regret bound: 


Reg(r) < — 
Tr 



(28) 


or, using a step-size sequence of the form jn — yn 


Reg(7’) < 


1 + y^^o 

r 


Vr. 


(29) 


Proof: See Appendix B. ■ 

The anytime regret bound (28) will be our core performance guarantee for Algorithm 1, so some remarks are in 
order; 

a) Fine-tuning y„: Theorem 1 shows that taking oc n~“ for some a e (0,1) leads to a regret guarantee that is 
0{T‘^) with (jj = max{a, 1 - a}f as such, (29) captures the optimal asymptotic behavior of the bound (28) for step-size 
sequences of the form y„ - yjn". In fact, if Vo can be estimated by the transmitter beforehand, the step-size parameter 
y can be fine-tuned further in order to minimize the coefficient of in (29). Doing just that gives y - I /Vo and 
provides the optimized bound: 

Reg(7’) < 2Vo Vr. (30) 


Since Vo is a bound on the Erobenius norm of the block-diagonal gradient matrices V(n), the guarantee (30) becomes 
O(KM^) so the algorithm’s overall regret will be at most linear in the number of subcarriers and quadratic in the 


®To see this, simply note that 2,/=] n “ = 0{T^ “) for large T and a e (0,1). 
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number of antennas. This guarantee is key for massive MIMO systems (where the number of transmit/receive antennas 
can grow to be quite large) because it provides a worst-case estimate for the system’s equilibration time. That being 
said, (30) only becomes tight in adversarial environments (e.g. in the presence of jamming); in typical scenarios, the 
user’s regret usually decays much faster and the system attains a stable, no-regret state within a few iterations, even 
for large numbers of antennas per user - cf. the detailed discussion in Sec. IV. 

b) The static case: If the user’s effective channels remain static over time and Q(n) is a no-regret policy, a 
straightforward concavity argument can be used to show that max„ EE(Q(n)) converges to the solution of the (static) EE 
maximization problem (9) [24, 49]. In this way, (OGA) can also be seen as a provably convergent low-cost algorithm 
for solving (9); furthermore, as we show in what follows, this convergence result continues to hold even in the presence 
of imperfect CSIT and measurement errors. 

c) Initialization: The agnostic initialization X(0) = 0 of Algorithm 1 means that the focal transmitter remains 
effectively silent during the first transmission frame (recall that Q oc X). As such, the first iteration of (OGA) can be 
seen as a “handshake” that allows the transmitter to estimate his effective wireless medium before starting the bona 
fide transmission of data frames. If the transmitter begins with a given belief regarding his effective channel conditions, 
the algorithm can be initialized more aggressively in a manner consistent with the user’s initial expectations (setting 
for instance X = (/TM) 'l for uniform power allocation across subcarriers and antennas). In so doing, the regret bound 
(28) can be tightened further but this only makes a significant difference if the transmission horizon T is very short (in 
the order of a few frames). 

d) Logarithmic regret under fair channel conditions: As stated. Theorem 1 provides a worst-case guarantee 
which holds without any further caveats on the evolution of the channels from one stage to the next (other than basic 
information-theoretic hypotheses that allow the receiver to decode the transmitter’s signal). As such, another important 
question that arises is whether we can achieve stronger performance guarantees under the additional hypothesis that 
channel conditions do not become too bad. 

To quantify this, note first that the Shannon rate function R{Q) - log det(I + HQH') is strongly concave in Q with 
a strong concavity constant that is an increasing function of the singular values of H.^ Accordingly, since the user’s 
energy efficiency function EE(Q) can be expressed as a perspective transformation R{Q) i-> xR(K.lx), the same will 
also hold for the strong concavity constant of m(X) over X [33, 44]. On that account, if we assume that: 

Hess(M(X; n)) < -a I for some a > 0 and for all n = 1,2,..., X 6 AT, (31) 

we obtain the following stronger result: 

Proposition 1. Assume that (OGA) is run with the step-size sequence jn — yinfor some y > with a as in (31). 
Then, the induced transmit policy Q(n) enjoys the logarithmic regret bound: 

Reg(T)< ^rVo"(i+iogr). 

^Recall here that a function / is strongly concave with constant c > 0 if Hess(/) < -cl. 


( 32 ) 
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Proof: See Appendix B. ■ 

Proposition 1 provides us with an important rule of thumb for choosing the step-size sequence of Algorithm 1 . On 
the one hand, if the user expects that his effective channel can become arbitrarily bad (e.g. due to network congestion or 
deep fading events), the OGA algorithm should be run with a n step-size sequence that allows higher adaptability 
to strongly varying channel conditions. Otherwise, if the user expects reasonable channel quality over his transmission 
horizon, the “softer” step-size choice y„ oc n * minimizes the danger of overcompensating for transmit directions that 
appear suboptimal and allows the user to converge to a no-regret state faster. 


C. Learning under uncertainty 

A key assumption in our analysis so far is that the transmitter has access to perfect CSI and MUI measurements with 
which to calculate the gradient matrices V(n) at each stage; in practice however, factors such as pilot contamination, 
sparse feedback and imperfect channel sampling could have a deleterious effect on the algorithm’s performance. As 
such, our goal in this section will be to analyze (OGA) in the presence of uncertainty and measurement errors. 

To formalize this, we will assume that, at each stage n - 1,2,..., the transmitter observes a noisy estimate V(n) of 
V(n) satisfying the statistical hypotheses: 

(HI) Unbiasedness: 

E[V(n)|Q(n-l)]=0. (HI) 

(H2) Tame error tails: 

P (||V(n) - V(n)|| > for some B > 0 and for some fi > 2. (H2) 

Clearly, both hypotheses are quite mild from a practical point of view. First, the unbiasedness hypothesis (HI) simply 
amounts to asking that there is no systematic error in the user’s measuremernts. Likewise, Hypothesis (H2) is a hare- 
bones assumption on the probability of observing very high errors and is satisfied by the vast majority of statistical 
error distributions (including uniform, Gaussian, log-normal, and all Levy-type error processes); in particular, we will 
not be assuming that the error process Z(«) = V(n) - V(n) is i.i.d., state-independent, or even a.s. bounded. 
Remarkably, under these minimal hypotheses, we have: 


Theorem 2. Assume that (OGA) is run with noisy measurements V(n) satisfying Hypotheses (HI) and (H2), and with 
a variable step-size sequence of the form y„ — yjn" for some a e (2//?, 1). Then, the induced transmit policy Q(n) 
leads to no regret (a.s.) and enjoys the mean regret bound: 


E[Reg(7’)] < — 
7t 




(33) 


where - sup„ E [||V(n)|p]. 


Theorem 2 (proven in Appendix C) will be our main result in the context of dynamic energy efficiency maximization 


under imperfect CSI, so a few remarks are in order: 
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a) Step-sizes vi. large error probabilities: The requirement afi > 2 of Theorem 2 indicates a trade-off between 
the probability of observing very large errors and achieving low regret (33). Specifically, if the error distribution of 
Z(«) = V(«) - V(n) has very heavy tails (i.e. (H2) does not hold for p » 2), Algorithm 1 must be bootstrapped with a 
conservative step-size sequence y,, oc l/n“ for some a 1; in so doing however, the first term of (33) becomes almost 
linear, so the user might experience relatively high regret on average (due to the high probability of observing very 
large errors). On the other hand, if the tails of V(n) are lighter (for instance, the standard case of normally distributed 
errors exhibits exponentially thin tails, so (H2) holds for all p). Algorithm 1 can be employed with a more adaptive 
step-size sequence that guarantees a lower regret bound. 

In particular, if (I-I2) holds for some y6 > 4, (OGA) can be used with a step-size sequence of the form y„ = yn 
which achieves the optimal behavior of (33), viz. 


E[Reg(7’)] < 



7 


(34) 


Thus, if the mean square bound Vq - sup„ E [||V(n)|p] can be estimated ahead of time,® the step-size sequence y„ can 
be optimized further. More precisely, working as in the deterministic case, the coefficient of in (34) is minimized 
when y - I IVq, so we obtain the optimized bound; 


E[Reg(7’)] < 2yo VT- 


(35) 


b) The estimation process: The no-regret properties of (OGA) under uncertainty rely on the availability of 
statistically unbiased measurements V of V. In turn, given that users have perfect knowledge of their individual 
transmit covariance matrices, this requirement boils down to constructing an unbiased estimator of the matrix A = 
H'(I + HQH1)‘H. In our Gaussian context, this can be accomplished via the statistical sampling process of [7, 8] 
which provides an unbiased estimator of A with exponentially decaying error tails (i.e. (H2) holds for all p > 2). 
However, due to space limitations we will not address this question in more detail here. 

c) Fair channel conditions and noise: As before, the regret guarantee (33) can be tightened significantly if the 
user’s effective channel conditions satisfy (31). In that case, running (OGA) with step-sizes y„ oc 1/n, we obtain the 
following stochastic analogue of Proposition 1 : 


Proposition 2. With notation as in Theorem 2, assume that (OGA) is run with noisy measurements and a variable 
step-size sequence y„ — yjnfor some y > with a defined as in (31). Then, the induced transmit policy Q(n) leads 
to no regret (a.s.) and enjoys the mean guarantee: 

E[Reg(7’)]<iyy2(l+logr). (36) 

Proof: See Appendix C. ■ 

As in the perfect CSI case. Proposition 2 provides a rule of thumb for achieving lower regret faster when the user’s 
(effective) wireless medium is not too bad; as long as (31) holds for some a > 0, the user can achieve logarithmic 
regret, even with very noisy measurements. 


'Note here that Vo is guaranteed to be finite on account of Hypothesis (H2). 
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TABLE I 

WIRELESS NETWORK SIMULATION PARAMETERS 


Parameter 

Value 

Parameter 

Value 

Number of cells 

19 (hexagonal) 

Cell radius 

1km 

User density 

500users/km^ 

Time frame duration 

5 ms 

Propagation model 

COST Hata 

BS/MS antenna height 

32m/ 1.5m 

Central frequency 

2.5 GHz 

Total bandwidth 

11.2 MHz 

OFDM subcarriers 

1024 

Subcarrier spacing 

11 kHz 

Spectral noise density (20 °C) 

-174 dBm/Hz 

User speed 

[3,130] km/h 

Maximum transmit power 

Pmax = 33 dBm 

Non-radiative power 

Pc = 20 dBm 

Transmit antennas per device 

M = A 

Receive antennas per link 

N = 8 


IV. Numerical Results 

In this section, we assess the practical performance aspects of the OGA algorithm via numerical simulations. For 
presentational clarity, we only present here a representative subset of these results but our conclusions apply to a wide 
range of wireless network parameters and specifications. 

Our setup is as follows: we consider a cellular OFDMA wireless network occupying a 10 MHz band divided into 
1024 subcarriers around a central frequency of fc - 2.5 GHz. Wireless signal propagation is modeled following 
the well-known COST Hata model [50, 51] and the spectral noise density is taken to be -174dBm/Hz at 20°C 
(for a detailed overview of simulation parameters, see Table I). Network coverage is provided by 19 hexagonal cells 
(each with a radius of 1 km) that form a honeycomb pattern spanning an urban area with wireless user density p = 
500 users/km^. To minimize complexity, OFDM subcarriers are allocated to wireless users in each cell following a 
simple randomized access scheme that assigns different users to disjoint subcarrier sets [52]; as such, the main sources 
of CCI are connections in neighboring cells that utilize the same subcarriers. 

To model this, we focus on a set of t/ = 15 transmitting users that are located in different cells (following a Poisson 
point process sampling) and share /T = 8 common subcarriers. Each wireless transmitter is further assumed to have 
M - 4 transmit antennas, a maximum transmit power of = 40 dBm and circuit (non-radiative) power consumption 
of Pc - 20 dBm; at the receiver end, we consider V = 8 receive antennas per connection and a receiver noise figure of 
7 dB. Finally, communication occurs over a time-division duplexing (TDD) scheme with frame duration Tf - 5 ms: 
transmission takes place during the uplink (UL) subframe while the receivers process the received signal and provide 
feedback during the downlink (DL) subframe; upon reception of the feedback, the users update their power profiles 
following Alg. 1 and the process repeats. 

A. Static channels 

For benchmarking purposes, our hrst simulation scenario addresses the case of stationary users with static channel 
conditions (so the variability of a user’s effective channel matrix is only due to the modulation of the interfering 
users’ transmit characteristics). Each user is assumed to run (OGA) with a variable step-size of the form 7 „ oc Ij sjn 
and an agnostic initialization with initial transmit power Pq - Pmax/2 = 26 dBm spread evenly across antennas and 
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(a) Transmit power evolution under OGA 


(b) Spectral efficiency evolution under OGA 



(c) Energy efficiency maximization under OGA 


(d) Regret minimization under OGA 


Fig. 2. Performance of the Algorithm 1 under static channel conditions. The system converges within a few iterations to an equilibrium state 
(Fig. 2(c)) where users experience no regret (Fig. 2(d)). 


subcarriers. Our simulation results are presented in Fig. 2 where, to minimize graphical clutter, we only plot the 
relevant information for 4 users with diverse channel characteristics. 

First, in Fig. 2(a), we plot the users’ transmit power under (OGA). As can be seen, even though users change their 
power by several dBm, the algorithm quickly equilibrates after an initial transient phase. Similarly, in Fig. 2(b), we 
plot the users’ transmit rate over all subcarriers (normalized by the bandwidth and thus measured in bps/Hz). We 
see here that users who reduce power by more than 10 dBm (Users 3 and 4) experience a commensurate drop in 
spectral efficiency (of the order of a few bps/Hz); on the other hand, users that decrease power only by a little achieve 
higher rates because the OGA algorithm leads to a more efficient allocation of power over subcarriers and antennas. 
Nonetheless, in all cases we observe a dramatic increase in energy efficiency over the users’ initial (uniform) power 
allocation policy, ranging from =» 200% to more than 600%.® 

In fact, as we see in Fig. 2(c), after some slight oscillations during the first few iterations (the algorithm’s transient 

^Contrary to Fig. 2(b), we do not normalize the users’ energy efficiency by the bandwidth, so it is measured in Mb/J. 
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phase), the system rapidly equilibrates and reaches a state where users have no incentive to change their individual 
power profiles (a Nash equilibrium). This equilibration is consistent with the no-regret properties of the OGA algo¬ 
rithm; as predicted by Theorem 1 and shown in Fig. 2(d), the users’ regret quickly decays to zero even though the 
algorithm’s agnostic - and, in hindsight, suboptimal - initialization leads to high regret in the first few iterations. 

B. Time-varying channels and mobility 

To account for dynamic network conditions, we also consider in Fig. 3 the case of mobile users whose channels vary 
with time due to Rayleigh fading, path loss fluctuations, etc. For simulation purposes, we used the extended typical 
urban (ETU) model for the users’ environment and the extended pedestrian A (ERA) and extended vehicular A (EVA) 
models to simulate pedestrian (3-5km/h) and vehicular (30-130km/h) movement respectively [53]; for reference, 
the focal users’ channel gains (tr(HH’ )) have been plotted in Eig. 3(a). Despite the channels’ variability, Eig. 3(b) 
shows that the users attain a no-regret state in a few iterations, even under rapidly changing channel conditions (cf. 
the case of Users 2 and 4 with an average speed of 30 km/h and 130 km/h respectively). For completeness, we also 
plot in Figs. 3(c) and 3(d) the achieved energy efficiency for a pedestrian and a vehicular user, and we compare it 
to its instantaneous maximum value, the users’ initial (uniform) power allocation policy, and the “oracle” solution 
which corresponds to the best fixed transmit profile in hindsight (i.e. the solution of the offline maximization problem 
(12) which posits that users can predict the system’s evolution in advance). Remarkably, even under rapidly changing 
channel conditions, the users’ achieved energy efficiency tracks its (evolving) maximum value remarkably well and 
consistently outperforms even the oracle solution (a fact which is consistent with the negative regret observed in 
Fig. 3(b)). 

An intuitive explanation for the adaptability of OGA is provided by Figs. 3(e) and 3(f) where we plot the transmit 
power of the optimum policy, the OGA scheme, and the oracle solution for the same users as in Figs. 3(c) and 3(d). 
Even though the optimum covariance matrix Q*(n) may change significantly from one frame to the next, tr(Q*(n)) 
remains roughly constant (within a few dBm) over the entire transmission horizon. The OGA algorithm then learns 
this power level in a few iterations and stays close to it throughout the transmission horizon; as a result, the users’ 
achieved energy efficiency remains itself very close to its maximum value for all time. 

C. Robustness to observation noise and scalability for large antenna numbers 

Einally, to assess the robustness of the OGA algorithm in the presence of observation noise and measurement 
errors, the simulation cycle above was repeated in Eig. 4 for the case where users only have access to noisy gradient 
observations as in Section Ill-C. Also, to study the algorithm’s scalability in the massive MIMO regime (large number 
of antennas), we increased the number of transmit antennas to M = 8 and the number of receive antennas to N - 128; 
otherwise, for comparison purposes, we used the same network simulation parameters as in Eig. 2. 

The intensity of the measurement noise was quantified via the relative error level of the estimator V, i.e. its standard 
deviation over its mean (so a relative error level of q% means that the observed matrix V lies within q% of its true 
value). We then plotted the users’ achieved energy efficiency under the OGA algorithm for noise levels q - 20% 


Power [dBm] Energy Efficiency [Mb/J] Ir(HH^) [dB] 
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Channel gains over time 



Time (ms) 


(a) Channel gain evolution for different user velocities 


Energy efficiency under mobility (v = 5 km/h) 



(c) Energy efficiency under OGA (pedestrian) 


Average regret under mobility 



(b) User regret under OGA 


Energy efficiency under mobility (v = 30 km/h) 



Transmit power under mobility (v = 5 km/h) 



Transmit power under mobility (v = 30 km/h) 
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(e) Transmit power evolution under OGA (pedestrian) 


(f) Transmit power evolution under OGA (vehicular) 


Fig. 3. Performance of the OGA algorithm in a dynamic setting with mobile users moving at v = {3,30,5,130} km/h. The users’ achieved energy 
efficiency tracks its (evolving) maximum value remarkably well, even under rapidly changing channel conditions. 


(moderate-to-high uncertainty) and 77 = 100% (very high uncertainty). As can be seen in Fig. 4, the system’s rate of 
equilibration is adversely affected by the intensity of the noise; however, the system still equilibrates within a few tens 
of iterations and the users exhibit a drastic increase in energy efficiency (of the order of 150% and higher), even in the 
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Energy efficiency under uncertainty (20% relative error) 



Energy efficiency under uncertainty (100% relative error) 

100 -^^^^^^^^^^- 



^ User 3 


10- User 4 


20 40 60 80 100 


Iterations 


Fig. 4. Performance of Algorithm 1 with imperfect measurements and observation errors. Remarkably, even under very high uncertainty, the system 
converges within a few tens of iterations to a stable state (dashed lines) where users experience no regret. 


presence of very high measurement noise. 


V. Conclusions and Perspectives 

In the context of multi-carrier, massive MIMO systems where numerous interfering mobile users co-exist, the 
temporal variability of the system cannot be ignored when targeting highly energy-efficient communications. To 
tackle these issues, we introduced an online semidefinite optimization framework for the study of energy efficiency 
maximization in dynamically varying networks, and we proposed an adaptive transmit policy that allows users to attain 
a “no-regret” state. Importantly, the proposed policy is distributed, asynchronous, computationally simple, and it only 
requires minimal, strictly causal and (potentially) noisy channel state information at the transmitter. Specifically, under 
very mild assumptions for the statistics of the error process, we showed that the users’ average regret after T epochs 
decays as a bound which is further improved to OQogT jT) if the users do not experience arbitrarily bad 

channel conditions. As a result, users are able to track their most efficient transmit power profile with modest feedback 
requirements, even under rapidly changing channel conditions (corresponding to highly mobile users): specifically, our 
simulations show that users could gain up to 600% in energy efficiency over fixed/uniform power allocation policies 
in realistic network environments. 

An important theoretical question which arises is whether the system converges to an equilibrium state if all users 
employ a no-regret policy (our numerical simulations show that this indeed the case over a wide region of system 
parameters). Additionally, different throughput-per-power models accounting for the probability of outage can also be 
considered and would require a modification of the proposed transmit policy. We intend to explore these directions in 
future work. 


Appendix 
Technical Proofs 

Throughout this appendix (and unless explicitly mentioned otherwise), all matrices are assumed to be Hermitian and 
of dimension D - KM. Additionally, the stage variable n will be written as a subscript instead of as a parenthetical 
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argument - i.e. we will write X„ and m„(X) instead of X(n) and m(X; n) respectively. We do so in order to reduce the 
notational clutter caused by an overflow of parentheses; since we will not require a subcarrier index, there is no fear 
of ambiguity. 


A. Matrix projections 

We first prove that the projection map n(Y) is given by the explicit formula (26). To that end, simply note that n(Y) 
can be expressed equivalently as the solution to the maximization problem 

maximize tr(YX) - ^ ||X||^ 

(37) 

subject to X € AT. 

However, if Y = UAU^ is a diagonalization of Y, the objective of (37) can be written as; 

tr(YX) - \ ||X||2 = tr(AU‘' XU) - \ tr(U‘' XUU‘'XU). (38) 


Thus, given that X 6 AT if and only if UXU’ 6 AT, we readily get n(Y) = Un(A)U^ so it suffices to prove (37) for 
diagonal Y. 

We first show that n(Y) is itself diagonal if Y = diag(y) for some y e R®. Indeed, we have: 
tr(YX) - i ||X||2 = . y,xl - \ ^ Z,- 

with equality if and only if X is diagonal. As a result, if X is a solution of (37), the diagonal matrix X' which coincides 
with X on the diagonal and has zero entries otherwise will also be a solution of (37); since (37) admits a unique 
solution, we conclude that n(Y) must also be diagonal, as claimed. We are thus left to solve the maximization problem 


maximize YjjyjXj - 5 2 / > 

(40) 


subject to Xj > 0 , ^ 1 - 

Writing 0 and /I > 0 for the Lagrange multipliers of the constraints xj > 0 and ^ 1 respectively, the 

first-order Karush-Kuhn-Tucker (KKT) conditions for (40) become: 


yj = Xj + A-Aj, (41a) 

AjXj -0, 4(1- Xj) - 0. (41b) 

Thus, to obtain the first branch of (27), simply note that if < 0 but xj > 0, we will also have Aj = 0, so (41a) gives 
yj - Xj + A > 0, a contradiction. Likewise, if 2 j-.yj>oyi ^ setting Aj - A - 0 and xj - yj for all j such that yj > 0 is 
obviously a solution of (41), so we obtain the second branch of (27). Finally, to obtain the third branch of (27), note 
first that 'Zjj:yj>oXj = 1 if Tjj-.yj^oyj ^ 1; otherwise, we would have 4 = 0 and (41a) would give yj = xj - Aj < xj 
whenever > 0, implying in turn that Yjj-.yj>oyj ^ Tjr.yj>oXj < 1, a contradiction. Accordingly, we are left to project 
the vector with components y'j' = [y,]^ to the unit simplex A = {x e R^ : xj >0 and YjJ xj =1); this projection simply 
gives Xi - [yf - 4]+ with 4 > 0 such that Yjilyt ~ 4]+ = 1 [48], so (26) follows. 
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B. No regret with perfect CSI 

The key step in bounding the user’s regret is the inequality; 

EE(Q*) - EE(Q) = m(X*) - m(X) < tr[V ■ (X* - X)], 


(42) 


which is a simple consequence of the fact that m(X) is concave in X. Our proof follows the methodology of [35] 
where OGA methods where used in a vector (as opposed to matrix) setting with the special step-size sequence y,, oc 


-1/2 


(as opposed to general y,,). To be precise, we will establish the no-regret properties of (OGA) by showing that 


■ (X* “ X/i)] = o(T) for all X* 6 AT and for every matrix sequence V„. 

Il 


Proof of Theorem 1: Letting D„ - j ||X* - X„||^, we get; 


1 


1 


1 


= - ||X* - X„+ir = 2 11^* - n(X„ + y„V„)r < 2 IIX* - X„ - y„V„ 
on account of the definition of n(Y) as the closest point to Y on X. In this way, (43) yields; 

Dn^i <D„-y„tT [V„ ■ (X* - X„)] + iy2 ||VJ|2 , 
and hence, after rearranging and summing over n, we obtain: 

T T , r 

2 tr [V„ • (X* - X„)] < 2 y„' (D„ - D„+i) + IIVJI^ 

n=l n=l n=l 

T ^ T 

< yr'Di + (y„-i - y„-ii)D„ + IIVJI^ 


(43) 


(44) 


n=l 

< yr'O + ^ (y„ ‘ - y„ii) O + ^ y„ = 1 + ^ y„, 


n=2 

T 


n=2 


2 ^ rr 2 , 

n=\ n=l 


(45) 


(46) 


where O s f maxx.x'eAf l|X - X'||^ = 1. The fact that X„ leads to no regret then follows by noting that 1/(7’yj-) ^ 0 
(by assumption) and that 7 « 0 (since y„ —» 0). ■ 

Proof of Proposition 1: We note first that the a-strong concavity assumption (31) for u gives; 

M„(X*) - M„(X„) < tr [V„(X* - X„)] - ||X* - X„f , 

2 y 

where we have used the fact that a > y“'. Thus, by summing over n and using (45), we obtain; 

T T 

Reg(T) < i £ (y„-i - y„-ii - y-^) ||X* - X„||2 + ^ r« IIVJI^ 

n=2 n=l 

1 ^ 1 

< ^ -yVo'd+logT), 

n=l 

where, in the second line, we used the fact that y„' - y^/j = ny“^ - in - l)y“^ = y“*. 

C. The case of imperfect CSI 

Proof of Theorem 2: As before, we begin with the basic inequality; 

Reg(r) = max V/ , [m„(X*) - u„(X„)] < max V/ , tr [V„ ■ (X* - X„)], 

X'eX X’eA'2--l«=l 


(47) 


( 48 ) 
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with X„ defined via the (stochastic) recursion: 


X„+l - n(X„ + ynWn)- 

Thus, for the first part of the theorem, we need to show that: 

T T 

2 tr [V„ • (X* - XJ] + 2 tr [Z„ ■ (X„ - X*)] = o{T) (a.s.), 

n=\ «=1 

where Z„ = V„ - V„. The first term of (50) then becomes: 

T T T 

2 tr [V„ ■ (X* - X„)] < + X W - ^n\) + 2 X 

«=2 «=1 
T ^ T 


(49) 


(50) 


rt=l 


«=2 
,2 T 




(51) 


n=\ \n=l / 

^11^ - 1 tl^of l-fl^ _ 'T^—1 11^ l|2 


where, as before, = 5 maxx.x'eAf l|X - X'|| = 1. We now claim that limT-^^oo Yj„=i 7n l|Z„|r —> 0 (a.s.)- Indeed, 
let Zn = l|Z„|| and choose e > 0 such that 4s < aj3 - 2 (recall that a/3 > 2); Hypothesis (H2) then implies that 
P(z„ > < B/n^^'^ for all n, so we obtain: 

CO CO 

2 p(z„ > 2 0(l/ni+®) < CX,, (52) 

«=1 «=1 

and hence, by the Borel-Cantelli lemma, we conclude that P (z„ > n“/ 2 -£//? Jqj. infinitely many n) - 0. In turn, this 

implies that z^-O almost surely, so we get: 


r T 




n-“n“- 2 £//? 


\ { T 


v«=l 




,2£//J 


= o{T) (a.s.). 


(53) 


For the second term of (50), let = tr [Z„ ■ (X* - X„)]. Then, given that X„ is a deterministic function of X„_i and 
V„_i, we will also have E [^„ | X„_i] = 0, i.e. is a sequence of martingale differences. By the strong law of large 
numbers for martingale differences [54, Theorem 2.18], it then follows that lim^^oo Tj'n=\ - 0 (a.s.). As a result, 
combining this with (53), we get 

T 

tr [V„ ■ (X* - X„)] = oiT) (a.s.), (54) 

n-l 

i.e. (49) leads to no regret, as claimed. 

Finally, for the mean regret bound (33), taking the expectation of the first line of (51) yields: 


E[Reg(r)] < rr'Q + 2 (y-' - ^ Z r« E [||V„||2] < 1 + ^ 2 y„, (55) 

n =2 «=1 «=1 

where we used the fact that Reg(7’) < E[V„ ■ (X* - X„)] = E[V„ ■ (X* - X,,)] for the LHS, and the assumption that 
E[||V„||^] < y^fortheRHS. ■ 

Proof of Proposition 2: By reasoning as in the proof of Proposition 1, we readily obtain: 


T ^ T 

J]tr[V„-(X*-X„)] < -2r«IIV„||^ 

n=l f7=l 


( 56 ) 
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SO (36) follows by taking expectations on both sides as in the proof of Theorem 2. That X„ leads to no regret then 
follows by noting that (53) holds even for a - 1, so the RHS of (56) is o(T). ■ 
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